Mutual Information Score (mutual_info_score)#
mutual_info_score measures the mutual information (MI) between two discrete labelings (often: two clusterings).
Think of it as: “How much does knowing clustering A reduce uncertainty about clustering B?”
Goals
Define MI using entropy and the joint distribution
Compute MI from a contingency table (counts)
Implement
mutual_info_scorefrom scratch in NumPy (and verify vs scikit-learn)Visualize how MI behaves under label permutation, random independence, and label noise
Use MI as a non-differentiable objective to tune a simple logistic-regression decision threshold
Understand pros/cons and when to prefer NMI/AMI
Quick import (scikit-learn)
from sklearn.metrics import mutual_info_score
Units: scikit-learn uses the natural log, so MI is measured in nats (use log2 for bits).
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_classification
from sklearn.metrics import (
adjusted_mutual_info_score,
mutual_info_score,
normalized_mutual_info_score,
)
from sklearn.model_selection import train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) What does mutual_info_score measure?#
We observe two label arrays of length \(n\):
labels_true: a partition \(U\) (e.g., ground-truth classes or clustering A)labels_pred: a partition \(V\) (e.g., predicted clusters or clustering B)
Mutual information \(I(U;V)\) is:
0 when \(U\) and \(V\) are independent (knowing one tells you nothing about the other)
large when \(U\) and \(V\) are strongly dependent (knowing one reveals a lot about the other)
In clustering, MI is popular because it is permutation-invariant: renaming clusters (0 ↔ 2, etc.) does not change the score.
2) Mutual information for discrete variables#
Let \(U\) and \(V\) be discrete random variables with joint pmf \(p(u,v)\) and marginals \(p(u)\), \(p(v)\).
Entropy (uncertainty)#
3) From label arrays to MI: the contingency matrix#
Given \(n\) paired observations \((u_i, v_i)\), define the contingency table counts:
\(n_{ij}\) = number of samples with \(u=i\) and \(v=j\)
\(n_{i*} = \sum_j n_{ij}\) (row sums)
\(n_{*j} = \sum_i n_{ij}\) (column sums)
\(n = \sum_{i,j} n_{ij}\) (total)
A common “plug-in” estimator replaces probabilities by empirical frequencies:
By convention, terms with \(n_{ij}=0\) contribute \(0\) (since \(0\log 0 := 0\)).
This is what sklearn.metrics.mutual_info_score computes (using the natural log).
def contingency_matrix_numpy(labels_true, labels_pred):
"""Build the contingency matrix N where N[i, j] counts samples with
true label i and predicted label j (after re-indexing labels to 0..K-1).
"""
labels_true = np.asarray(labels_true)
labels_pred = np.asarray(labels_pred)
if labels_true.shape[0] != labels_pred.shape[0]:
raise ValueError("labels_true and labels_pred must have the same length")
true_values, true_inv = np.unique(labels_true, return_inverse=True)
pred_values, pred_inv = np.unique(labels_pred, return_inverse=True)
n_true = true_values.shape[0]
n_pred = pred_values.shape[0]
flat = true_inv * n_pred + pred_inv
counts = np.bincount(flat, minlength=n_true * n_pred)
contingency = counts.reshape(n_true, n_pred)
return contingency, true_values, pred_values
def mutual_info_score_numpy(labels_true, labels_pred):
"""NumPy implementation of mutual information between two labelings.
Matches sklearn.metrics.mutual_info_score (natural log, result in nats).
"""
contingency, _, _ = contingency_matrix_numpy(labels_true, labels_pred)
contingency = contingency.astype(float)
n = contingency.sum()
if n == 0:
return 0.0
row_sum = contingency.sum(axis=1)
col_sum = contingency.sum(axis=0)
i_idx, j_idx = np.nonzero(contingency)
nij = contingency[i_idx, j_idx]
mi = (nij / n) * np.log((nij * n) / (row_sum[i_idx] * col_sum[j_idx]))
return float(mi.sum())
# Quick verification vs scikit-learn
for k_true, k_pred in [(3, 3), (4, 2), (10, 10)]:
labels_true_rand = rng.integers(0, k_true, size=500)
labels_pred_rand = rng.integers(0, k_pred, size=500)
mi_sklearn = mutual_info_score(labels_true_rand, labels_pred_rand)
mi_numpy = mutual_info_score_numpy(labels_true_rand, labels_pred_rand)
print(
f"k_true={k_true:>2}, k_pred={k_pred:>2} | sklearn={mi_sklearn:.6f} numpy={mi_numpy:.6f} | diff={abs(mi_sklearn - mi_numpy):.2e}"
)
k_true= 3, k_pred= 3 | sklearn=0.003990 numpy=0.003990 | diff=2.64e-16
k_true= 4, k_pred= 2 | sklearn=0.004427 numpy=0.004427 | diff=5.85e-16
k_true=10, k_pred=10 | sklearn=0.079174 numpy=0.079174 | diff=1.94e-16
4) Intuition: permutation, independence, and noise#
We’ll build a small toy labeling with 3 clusters and compare three predicted labelings:
Perfect but permuted: same partition, different label ids (MI should be maximal)
Independent random labels: no relationship (MI near 0)
Noisy labels: flip some fraction of labels at random (MI decreases with noise)
n_per_cluster = 150
labels_true = np.repeat(np.arange(3), n_per_cluster)
rng.shuffle(labels_true)
# 1) perfect but permuted
perm = np.array([2, 0, 1])
labels_perm = perm[labels_true]
# 2) independent random labels (same number of clusters)
labels_random = rng.integers(0, 3, size=labels_true.shape[0])
# 3) noisy labels: with prob p, replace by a random label
def add_label_noise(labels, n_labels, p_noise, rng):
labels = np.asarray(labels).copy()
mask = rng.random(labels.shape[0]) < p_noise
labels[mask] = rng.integers(0, n_labels, size=mask.sum())
return labels
labels_noisy = add_label_noise(labels_true, n_labels=3, p_noise=0.3, rng=rng)
for name, lab in [
("perfect (permuted)", labels_perm),
("random (independent)", labels_random),
("noisy (p=0.3)", labels_noisy),
]:
print(f"{name:>18}: MI={mutual_info_score_numpy(labels_true, lab):.4f} nats")
perfect (permuted): MI=1.0986 nats
random (independent): MI=0.0008 nats
noisy (p=0.3): MI=0.4862 nats
def contingency_heatmap_trace(contingency, true_vals, pred_vals, showscale=False):
contingency = np.asarray(contingency)
return go.Heatmap(
z=contingency,
x=[str(v) for v in pred_vals],
y=[str(v) for v in true_vals],
colorscale="Blues",
showscale=showscale,
text=contingency,
texttemplate="%{text}",
hovertemplate="true=%{y}<br>pred=%{x}<br>count=%{z}<extra></extra>",
)
cont_perm, true_vals, pred_perm_vals = contingency_matrix_numpy(labels_true, labels_perm)
cont_rand, _, pred_rand_vals = contingency_matrix_numpy(labels_true, labels_random)
cont_noisy, _, pred_noisy_vals = contingency_matrix_numpy(labels_true, labels_noisy)
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=[
"Perfect (permuted)",
"Random (independent)",
"Noisy (p=0.3)",
],
)
fig.add_trace(contingency_heatmap_trace(cont_perm, true_vals, pred_perm_vals), row=1, col=1)
fig.add_trace(contingency_heatmap_trace(cont_rand, true_vals, pred_rand_vals), row=1, col=2)
fig.add_trace(contingency_heatmap_trace(cont_noisy, true_vals, pred_noisy_vals), row=1, col=3)
fig.update_xaxes(title_text="pred label", row=1, col=1)
fig.update_xaxes(title_text="pred label", row=1, col=2)
fig.update_xaxes(title_text="pred label", row=1, col=3)
fig.update_yaxes(title_text="true label", row=1, col=1)
fig.update_layout(title="Contingency tables (counts)", height=350)
fig.show()
A quick read:
A near diagonal contingency table means each true cluster mostly maps to one predicted cluster ⇒ high MI.
A near uniform table means predicted labels don’t depend on true labels ⇒ MI near 0.
ps = np.linspace(0.0, 1.0, 41)
n_repeats = 30
mi_means = []
mi_stds = []
for p_noise in ps:
vals = []
for _ in range(n_repeats):
lab = add_label_noise(labels_true, n_labels=3, p_noise=float(p_noise), rng=rng)
vals.append(mutual_info_score_numpy(labels_true, lab))
vals = np.array(vals)
mi_means.append(vals.mean())
mi_stds.append(vals.std())
mi_means = np.array(mi_means)
mi_stds = np.array(mi_stds)
fig = go.Figure()
fig.add_trace(go.Scatter(x=ps, y=mi_means, mode="lines", name="mean MI"))
fig.add_trace(
go.Scatter(
x=np.r_[ps, ps[::-1]],
y=np.r_[mi_means - mi_stds, (mi_means + mi_stds)[::-1]],
fill="toself",
fillcolor="rgba(0,0,0,0.10)",
line=dict(color="rgba(0,0,0,0)"),
hoverinfo="skip",
name="±1 std",
)
)
max_mi = np.log(3)
fig.add_hline(y=max_mi, line_dash="dash", line_color="gray", annotation_text="log(3) max")
fig.update_layout(
title="Mutual information vs label noise",
xaxis_title="noise probability p (replace label with random label)",
yaxis_title="MI (nats)",
)
fig.show()
5) What MI is “adding up”: observed vs expected co-occurrences#
Each contingency cell \((i,j)\) compares:
observed count: \(n_{ij}\)
expected count under independence: \(\frac{n_{i*} n_{*j}}{n}\)
The per-cell contribution is:
Let’s visualize those contributions.
def mi_contribution_matrix(contingency):
contingency = np.asarray(contingency, dtype=float)
n = contingency.sum()
if n == 0:
return np.zeros_like(contingency, dtype=float)
row_sum = contingency.sum(axis=1, keepdims=True)
col_sum = contingency.sum(axis=0, keepdims=True)
denom = row_sum * col_sum
ratio = np.ones_like(contingency, dtype=float)
np.divide(contingency * n, denom, out=ratio, where=(contingency > 0))
contrib = np.zeros_like(contingency, dtype=float)
mask = contingency > 0
contrib[mask] = (contingency[mask] / n) * np.log(ratio[mask])
return contrib
contrib_noisy = mi_contribution_matrix(cont_noisy)
fig = px.imshow(
contrib_noisy,
text_auto=".3f",
aspect="auto",
color_continuous_scale="RdBu",
color_continuous_midpoint=0.0,
labels=dict(x="pred label", y="true label", color="contrib (nats)"),
x=[str(v) for v in pred_noisy_vals],
y=[str(v) for v in true_vals],
title="Per-cell MI contributions (noisy example, p=0.3)",
)
fig.show()
print(f"Sum of contributions = {contrib_noisy.sum():.4f} nats (should equal MI)")
Sum of contributions = 0.4862 nats (should equal MI)
6) Using MI in a simple optimization loop: tune a logistic-regression threshold#
mutual_info_score needs discrete labels. For a probabilistic classifier (like logistic regression) we usually have scores/probabilities \(\hat{p}(x)\).
A common way to use MI is as a validation objective over a non-differentiable choice, e.g. the decision threshold \(\tau\):
We’ll:
fit logistic regression from scratch (gradient descent on log-loss),
sweep thresholds on a validation set,
pick the threshold that maximizes MI.
X, y = make_classification(
n_samples=1600,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.8, 0.2],
class_sep=1.2,
random_state=42,
)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.35, random_state=42, stratify=y
)
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
X_train_s = (X_train - mean) / std
X_val_s = (X_val - mean) / std
# add intercept column
X_train_b = np.c_[np.ones(X_train_s.shape[0]), X_train_s]
X_val_b = np.c_[np.ones(X_val_s.shape[0]), X_val_s]
fig = px.scatter(
x=X_val_s[:, 0],
y=X_val_s[:, 1],
color=y_val.astype(int),
title="Validation data (standardized features)",
labels={"x": "x1 (standardized)", "y": "x2 (standardized)", "color": "class"},
)
fig.show()
def sigmoid(z):
z = np.clip(z, -50, 50) # numerical stability
return 1.0 / (1.0 + np.exp(-z))
def fit_logistic_regression_gd(X, y, lr=0.2, n_iter=2500, l2=0.0):
"""Binary logistic regression with gradient descent on log-loss.
X: (n, d) including intercept column if desired.
y: (n,) in {0,1}.
l2: L2 regularization strength (applied to weights except intercept).
"""
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
n, d = X.shape
w = np.zeros(d, dtype=float)
eps = 1e-12
losses = np.empty(n_iter, dtype=float)
for t in range(n_iter):
p = sigmoid(X @ w)
loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
loss += 0.5 * l2 * np.sum(w[1:] ** 2)
losses[t] = loss
grad = (X.T @ (p - y)) / n
grad[1:] += l2 * w[1:]
w -= lr * grad
return w, losses
w, losses = fit_logistic_regression_gd(X_train_b, y_train, lr=0.2, n_iter=2500, l2=0.1)
print("w =", w)
fig = px.line(
y=losses,
title="Training curve (log-loss)",
labels={"x": "iteration", "y": "log-loss"},
)
fig.show()
w = [-1.6346 1.0995 0.1196]
def accuracy(y_true, y_pred):
y_true = np.asarray(y_true).astype(int)
y_pred = np.asarray(y_pred).astype(int)
return float(np.mean(y_true == y_pred))
def confusion_counts(y_true, y_pred):
y_true = np.asarray(y_true).astype(int)
y_pred = np.asarray(y_pred).astype(int)
tp = int(np.sum((y_true == 1) & (y_pred == 1)))
fp = int(np.sum((y_true == 0) & (y_pred == 1)))
tn = int(np.sum((y_true == 0) & (y_pred == 0)))
fn = int(np.sum((y_true == 1) & (y_pred == 0)))
return tp, fp, tn, fn
p_val = sigmoid(X_val_b @ w)
taus = np.linspace(0.01, 0.99, 99)
mi_vals = np.empty_like(taus)
acc_vals = np.empty_like(taus)
for i, tau in enumerate(taus):
y_pred_tau = (p_val >= tau).astype(int)
mi_vals[i] = mutual_info_score_numpy(y_val, y_pred_tau)
acc_vals[i] = accuracy(y_val, y_pred_tau)
best_i = int(np.argmax(mi_vals))
best_tau = float(taus[best_i])
y_pred_best = (p_val >= best_tau).astype(int)
y_pred_05 = (p_val >= 0.5).astype(int)
print(f"Best threshold (by MI): tau* = {best_tau:.2f}")
print(f"MI(tau*) = {mutual_info_score_numpy(y_val, y_pred_best):.4f} nats")
print(f"MI(0.50) = {mutual_info_score_numpy(y_val, y_pred_05):.4f} nats")
print(f"Acc(tau*) = {accuracy(y_val, y_pred_best):.3f}")
print(f"Acc(0.50) = {accuracy(y_val, y_pred_05):.3f}")
tp, fp, tn, fn = confusion_counts(y_val, y_pred_best)
print(f"Confusion@tau*: TP={tp} FP={fp} TN={tn} FN={fn}")
Best threshold (by MI): tau* = 0.36
MI(tau*) = 0.3175 nats
MI(0.50) = 0.2191 nats
Acc(tau*) = 0.954
Acc(0.50) = 0.914
Confusion@tau*: TP=92 FP=4 TN=442 FN=22
fig = go.Figure()
fig.add_trace(go.Scatter(x=taus, y=mi_vals, mode="lines", name="MI (nats)"))
fig.add_trace(go.Scatter(x=taus, y=acc_vals, mode="lines", name="Accuracy", yaxis="y2"))
fig.add_vline(x=best_tau, line_dash="dash", line_color="black", annotation_text="tau*")
fig.add_vline(x=0.5, line_dash="dot", line_color="gray", annotation_text="0.5")
fig.update_layout(
title="Threshold sweep on validation set",
xaxis_title="threshold tau",
yaxis=dict(title="MI (nats)"),
yaxis2=dict(title="Accuracy", overlaying="y", side="right", range=[0, 1]),
legend=dict(x=0.01, y=0.99),
)
fig.show()
def logit(p):
p = np.clip(p, 1e-12, 1 - 1e-12)
return np.log(p / (1 - p))
def decision_boundary_x2(w, tau, x1_grid):
# w0 + w1*x1 + w2*x2 = logit(tau)
w0, w1, w2 = w
if abs(w2) < 1e-12:
raise ValueError("w2 is ~0; boundary would be vertical in (x1,x2)")
return (logit(tau) - w0 - w1 * x1_grid) / w2
x1_min, x1_max = X_val_s[:, 0].min() - 0.5, X_val_s[:, 0].max() + 0.5
x1_grid = np.linspace(x1_min, x1_max, 200)
x2_tau05 = decision_boundary_x2(w, tau=0.5, x1_grid=x1_grid)
x2_best = decision_boundary_x2(w, tau=best_tau, x1_grid=x1_grid)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=X_val_s[:, 0],
y=X_val_s[:, 1],
mode="markers",
marker=dict(
color=y_val.astype(int),
colorscale="Viridis",
showscale=True,
colorbar=dict(title="class"),
size=6,
opacity=0.8,
),
name="val points",
)
)
fig.add_trace(
go.Scatter(
x=x1_grid,
y=x2_tau05,
mode="lines",
line=dict(color="black", dash="dot", width=3),
name="boundary (tau=0.5)",
)
)
fig.add_trace(
go.Scatter(
x=x1_grid,
y=x2_best,
mode="lines",
line=dict(color="crimson", dash="dash", width=3),
name=f"boundary (tau*={best_tau:.2f})",
)
)
fig.update_layout(
title="Decision boundary shifts when you change the threshold",
xaxis_title="x1 (standardized)",
yaxis_title="x2 (standardized)",
height=450,
)
fig.show()
Bonus: MI is a non-smooth objective (toy grid search)#
If we hard-threshold predictions, MI changes only when the decision boundary flips some points. That makes \(I(Y;\hat{Y})\) a piecewise-constant function of model parameters.
To make this visible, we’ll take a 1D linear classifier on the validation set:
and grid-search \((w_0, w_1)\) to maximize MI.
x1 = X_val_s[:, 0]
y_bin = y_val.astype(int)
w0_grid = np.linspace(-4.0, 4.0, 81)
w1_grid = np.linspace(-4.0, 4.0, 81)
mi_grid = np.empty((w0_grid.size, w1_grid.size), dtype=float)
for i, w0 in enumerate(w0_grid):
for j, w1 in enumerate(w1_grid):
y_pred = (w0 + w1 * x1 >= 0).astype(int) # tau=0.5 boundary (logit=0)
mi_grid[i, j] = mutual_info_score_numpy(y_bin, y_pred)
best_idx = np.unravel_index(np.argmax(mi_grid), mi_grid.shape)
w0_best = float(w0_grid[best_idx[0]])
w1_best = float(w1_grid[best_idx[1]])
print(
f"Best (w0,w1) on this grid: ({w0_best:.2f}, {w1_best:.2f}) | MI={mi_grid[best_idx]:.4f} nats"
)
fig = px.imshow(
mi_grid,
x=w1_grid,
y=w0_grid,
origin="lower",
aspect="auto",
color_continuous_scale="Viridis",
labels=dict(x="w1", y="w0", color="MI (nats)"),
title="MI landscape for a 1D hard-threshold classifier (piecewise constant)",
)
fig.add_scatter(
x=[w1_best],
y=[w0_best],
mode="markers",
marker=dict(color="red", size=10),
name="best",
)
fig.show()
Best (w0,w1) on this grid: (1.60, -1.50) | MI=0.2973 nats
Takeaway: MI can be used to tune discrete choices (thresholds, model selection, number of clusters) via search. With hard predictions it becomes step-like (non-smooth) in typical model parameters, so it is rarely used as a direct gradient-based training loss.
7) Pitfalls (and why NMI/AMI exist)#
Two common gotchas:
MI is not normalized: values depend on the entropies of the labelings (number of clusters and their balance). Comparing MI across datasets can be misleading.
MI does not penalize over-segmentation: a clustering with many tiny clusters can still have a high MI with the ground truth.
Let’s see the second pitfall.
labels_true_small = labels_true
labels_unique = np.arange(labels_true_small.shape[0]) # each sample its own cluster
for name, pred in [
("perfect (permuted)", labels_perm),
("unique cluster per sample", labels_unique),
("random labels (3 clusters)", labels_random),
]:
mi = mutual_info_score(labels_true_small, pred)
nmi = normalized_mutual_info_score(labels_true_small, pred)
ami = adjusted_mutual_info_score(labels_true_small, pred)
print(f"{name:>24}: MI={mi:.4f} | NMI={nmi:.4f} | AMI={ami:.4f}")
perfect (permuted): MI=1.0986 | NMI=1.0000 | AMI=1.0000
unique cluster per sample: MI=1.0986 | NMI=0.3048 | AMI=-0.0000
random labels (3 clusters): MI=0.0008 | NMI=0.0007 | AMI=-0.0034
8) Pros, cons, and where MI shines#
Pros
Permutation-invariant: relabeling clusters doesn’t change the score.
Captures general dependence: not limited to linear relationships (for discrete variables).
Uses full contingency table: considers all overlaps, not only “matching” pairs.
Cons / caveats
Unbounded / not directly interpretable across problems (depends on label entropies).
Can reward over-segmentation (many clusters) → consider
normalized_mutual_info_scoreoradjusted_mutual_info_score.Requires discrete variables: continuous features need discretization or different MI estimators.
Non-differentiable for hard labels: not a convenient loss for gradient-based training; better as a validation metric / search objective.
Common use cases
External clustering evaluation when ground truth labels exist.
Comparing two different clusterings/segmentations of the same data.
Feature selection (MI between a discrete target and a discretized feature).
Exercises#
Implement normalized mutual information (NMI) from scratch using entropies.
Show how MI changes when clusters are highly imbalanced (e.g., 95/5 split).
Use MI to select the best of several clustering hyperparameters (e.g., choose \(k\) in k-means when ground truth exists).
References#
scikit-learn:
sklearn.metrics.mutual_info_scoreCover & Thomas, Elements of Information Theory (entropy, mutual information)